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Abstract 

Background: Diarrheal diseases continue to contribute significantly to morbidity and mortality in infants and 
young children in developing countries. There is an urgent need to better understand the contributions of novel, 
potentially uncultured, diarrheal pathogens to severe diarrheal disease, as well as distortions in normal gut 
microbiota composition that might facilitate severe disease. 

Results: We use high throughput 16S rRNA gene sequencing to compare fecal microbiota composition in children 
under five years of age who have been diagnosed with moderate to severe diarrhea (MSD) with the microbiota 
from diarrhea-free controls. Our study includes 992 children from four low-income countries in West and East Africa, 
and Southeast Asia. Known pathogens, as well as bacteria currently not considered as important diarrhea-causing 
pathogens, are positively associated with MSD, and these include Escherichia/Shigella, and Granulicatella species, and 
Streptococcus mitis/pneumoniae groups. In both cases and controls, there tend to be distinct negative correlations 
between facultative anaerobic lineages and obligate anaerobic lineages. Overall genus-level microbiota composition 
exhibit a shift in controls from low to high levels of Prevotella and in MSD cases from high to low levels of 
Escherichia/Shigella in younger versus older children; however, there was significant variation among many genera 
by both site and age. 

Conclusions: Our findings expand the current understanding of microbiota-associated diarrhea pathogenicity in 
young children from developing countries. Our findings are necessarily based on correlative analyses and must be 
further validated through epidemiological and molecular techniques. 



Background 

Diarrheal diseases continue to be major causes of child- 
hood mortality, ranking among the top four largest con- 
tributors to years of life lost in sub-Saharan Africa and 
South Asia [1]. The proportion of deaths attributed to 
diarrhea among children aged under 5 years is estimated 
to be approximately 15% worldwide [2], and as high as 
approximately 25% in Africa and 31% in South East Asia 
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[3]. More than two dozen enteric pathogens, belonging 
to diverse branches of the tree of life, are known to 
cause diarrhea and can be tested for in a clinical setting. 
However, it is likely that additional pathogens remain to 
be identified among the enteric microbiota. 

In response to important unanswered questions sur- 
rounding the burden and etiology of childhood diarrhea 
in developing countries, the William and Melinda Gates 
Foundation commissioned the Global Enterics Multicen- 
ter Study (GEMS) [4], which recently reported the patho- 
gens responsible for cases of moderate-to-severe diarrhea 
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(MSD) in seven impoverished countries of sub-Saharan 
Africa and south Asia. Importantly, for approximately 60% 
of MSD cases in GEMS, no known pathogen could be im- 
plicated by conventional diagnostic methods [5]. These 
observations highlight the potential presence of previously 
undiscovered pathogens, and/or possible interactions be- 
tween pathogens and other members of the intestinal 
microbiota (both pathogenic and commensal) that may ei- 
ther exacerbate the clinical manifestation or protect the 
host from disease. 

Here we apply molecular techniques to survey the intes- 
tinal microbiota in a subset of GEMS cases and controls. 
Our study comprises 992 children from four under- 
developed countries in West Africa (The Gambia and 
Mali), East Africa (Kenya), and South Asia (Bangladesh), 
representing a subset of the over 25,000 GEMS children 
enrolled. Our results shed additional light on potential 
mechanisms underlying MSD in children of developing 
countries. Prior to presenting these results we would like 
to stress that our analyses are, by necessity, correlative 
and the results presented here must be validated through 
epidemiological and molecular analyses, several of which 
are already underway. 

Results and discussion 

Description of data 

Our data comprise roughly equal proportions of cases 
and controls (0.51 vs. 0.49, respectively) from four sites: 
Bangladesh (N = 206), The Gambia (N = 269), Kenya (N = 
305), and Mali (N = 212). Approximately 55% of the sub- 
jects were boys. Of 992 samples, 508 were from patients 
with MSD (Table 1). The children ranged in age from 
newborn to 59 months. We stratified them into five age 
categories: 0 to 5 months (N = 112), 6 to 11 months (N = 
308), 12 to 17 months (N = 173), 18 to 23 months (N = 
146), and 24 to 59 months (N = 253). There were no 
significant differences between the proportion of cases 
and controls in each country and from each age group 
(Table 1). The sequencing of PCR amplified 16S rRNA 
genes resulted in 3,584,096 reads passing quality checks. 
Each sample had at least 1,000 reads, and there were an 
average of 3,613 reads per sample. The reads were clus- 
tered using DNAclust [6] into 97,666 operational taxo- 
nomic units (OTUs). Of these, 21,247 passed chimera 
checking, were detected in more than five samples, or rep- 
resented at least 20 sequences in a single sample, and were 
included in further analysis. The number of OTUs per 
sample ranged from 55 to 1252, with a median of 380 and 
an average of 412. The mean OTU size was 138, ranging 
from 5 (by definition) to 192,978 (with median OTU 
size = 15 sequences). Representative sequences from 
the 21,247 OTUs matched 728 distinct taxa from 161 
genera. Among these, 4,730 (22 %) did not have good 
(>100 bp exact match, >97% identity) matches to isolate 



Table 1 Demographics of the children 



Demographic characteristics for samples (N = 992), N (%) 





N = 508 


Controls 
N = 484 


1 Oldl 

P value N = 992 


Age groups by months 






0.1788 


Oto 5 


58 (11) 


54(11) 


112 (11) 


6 to 11 


171 (34) 


137 (28) 


308 (31) 


12 to 17 


93 (18) 


80 (17) 


173 (17) 


18 to 23 


70 (14) 


76 (16) 


146 (15) 


24 to 59 


116 (23) 


137 (28) 


253 (26) 


Country 






0.3622 


The Gambia 


138 (27) 


131 (27) 


269 (27) 


Mali 


110 (22) 


102 (21) 


212 (21) 


Kenya 


165 (32) 


140 (29) 


305 (31) 


Bangladesh 


95 (19) 


1 1 1 (23) 


206 (21) 


Gender 






0.5785 


Male 


286 (56) 


264 (54) 


550 (55) 


Female 


222 (44) 


220 (46) 


442 (45) 


Dysenteric stools 










140 (28) 


7(1) 


<10" 16 147(15) 



All ages are in months. P values test independence of MSD cases and controls 
with regards to demographic variable. P values for age in months (treated as a 
continuous variable) computed by independent samples t-test. P values for 
categorical variables calculated using chi-square test. 
MSD: Moderate-to-severe diarrhea. 



sequences from the Ribosomal Database Project (RDP). 
These were flagged as unassignecT in our analysis and are 
discussed further below. These sequences are not simply 
an artifact of our stringent alignment criteria as evidenced 
by the fact that a re-analysis of the 6,879 most abundant 
OTUs using the reference-based OTU picking algorithm 
implemented in Qiime [7] failed to classify a similar pro- 
portion of sequences (2,162 or 31% of the abundant 
OTUs). 

Microbiota variations by age 

The well documented [8-10] succession of the intestinal 
microbiota during child development is apparent in our 
non-diarrheal control samples (Figure 1A). During the 
first year of life, the 'healthy' gut microbiota in our infant 
cohorts is characterized by comparatively low overall di- 
versity and a relatively high proportion of facultatively 
anaerobic, and potentially pathogenic, organisms (for ex- 
ample, the Escherichia/Shigella group, which cannot be 
distinguished from each other by 16S rRNA gene se- 
quences), organisms that are believed to play a role in 
the development of the host immune system [11,12]. In 
older ages, the dominance of these organisms is reduced, 
replaced by a corresponding increase in overall diversity 
(Figure IB), accompanied by a particularly pronounced 
increase in the proportional abundance of the bacterial 
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Figure 1 Comparison of diarrheal and non-diarrheal stool. (A) Proportional abundance of genera in non-diarrheal controls and MSD cases in 
different age categories. Each color represents a different group. The order and color for each group is the same for controls (patients without 
MSD) and MSD cases. The eight groups most frequently found in controls (Prevotella, Bacteroides, Escherichia/Shigella, Veillonella, Streptococcus, 
Lactobacillus, Faecalibacterium, Megasphaera, plus unassigned and other) are depicted. (B) Shannon diversity index across ages and diarrheal 
status. Average Shannon diversity indices for the five different age strata as well as the corresponding 95% confidence intervals. Both cases and 
controls exhibited higher mean Shannon diversity index scores at higher age groups compared to lower age groups (P <0.001, one-way ANOVA). 
The diversity of healthy samples positively correlates with age in the first 2 years of life, as previously reported [12]. The diversity index for cases is 
significantly less than that for controls within each country [P <0.02, Tukey's t-test corrected for multiple comparisons). Also see Additional file 2: 
Table S2 and Additional file 3: Figure SI. 



genus Prevotella. These changes are most evident in our 
non-diarrheal control samples, where the genus Prevo- 
tella increases from approximately 12% to approximately 
48% proportional abundance during the first 5 years of 
life, while the Escherichia genus drops from about 20% 
proportional abundance in infants under 6 months of 
age to approximately 1% in 2- to 5-year-olds (Additional 
file 1: Table SI). Two other genera, Veillonella and Strep- 
tococcus also exhibit significant decreases with increas- 
ing age. Our data also show an increase with increasing 
age in the proportion of a range of organisms (labeled 



unassigned' in Figure 1A and Additional file 1: Table SI) 
that have no good quality matches to cultured isolates in 
public databases, and which appear to belong predomi- 
nantly to obligate anaerobic bacteria (over 60% can be 
assigned by the RDP classifier to the Ruminococcaceae 
and Lachnospiraceae families of the Firmicutes phylum, 
which are relatively poorly represented in culture collec- 
tions [13], as well as the Bacteroidaceae family). These 
previously-uncultured putative obligate anaerobes in- 
crease in proportional abundance from approximately 
8% in diarrhea-free young children to approximately 
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23% in the older age group, consistent with increase in 
diversity within the intestinal microbiota and the known 
expansion of these groups, which are able to colonize 
the intestine in greater numbers as the complex poly- 
saccharides they utilize for growth become a greater 
feature of the host diet [14]. 

These observations broadly hold when stratifying by 
country of origin; however, country-specific effects are 
also apparent. For example, the samples from Bangladesh 
are different from the African countries, particularly in 
the younger age groups, and are characterized by a lower 
proportion of Prevotella sequences and a higher propor- 
tion of organisms from the Escherichia/Shigella and Strep- 
tococcus genera (Figure 1A and Additional file 1: Table SI). 

The patterns observed within control samples were 
significantly different from patterns from patients with 
MSD; however, some overall age-related trends were 
similar. For example, Prevotella abundance correlates 
with age, albeit reaching a much lower peak, with only 
23% abundance in the oldest age group (vs. 48% in con- 
trols, P <10~ 16 ). Other obligate anaerobic microbes have 
lower proportional abundance among cases compared 
to controls: Bacteroides and the unclassified putative 
anaerobes are both 5% lower in cases, consistent with 
previous observations that indicate intestinal dysbiosis is 
associated with a decrease in the proportional abun- 
dance of obligate anaerobes [15]. Among cases, Escheri- 
chia/Shigella and Streptococcus spp. maintain a high 
proportion across all age groups, though their prepon- 
derance drops significantly (41% to 13% and 18.5% to 
7.5%, respectively) as children age. Furthermore it ap- 
pears that Prevotella and Escherichia/Shigella are ne- 
gatively correlated in MSD cases (Spearman rho = -0.55, 
P <0.0001). The disruption associated with diarrhea is 
also reflected in lower diversity values in MSD cases in 
every age group (Figure IB, Additional file 2: Table S2, 
Additional file 3: Figures S1A-D). 

Country-specific effects were also observed in diar- 
rheal stool; for instance, in Kenya, diarrhea appeared to 
have a less marked effect on the microbiota (Figure 1A 
and Additional file 1: Table SI). Escherichia/Shigella spp. 
were most common in Mali, accounting for 34% of the 
sequences, next most common in Bangladesh (24%) and 
least common in The Gambia (15%). Prevotella spp. 
were found in high proportional abundances in The 
Gambia (18%) and Kenya (19%). The genus Streptococ- 
cus is found in relatively high abundances in Bangladesh 
(21%) and The Gambia (13%) with lower abundances in 
Mali (10%) and Kenya (9%). As expected, the taxonomic 
diversity (Shannon diversity index) is significantly differ- 
ent between cases and controls in all countries (P <0.005, 
pairwise t-test). Of note, where Prevotella is more com- 
mon (The Gambia and Kenya), the diversity is higher 
(Figure IB). 



Taxonomic groups statistically increased or decreased in 
diarrhea 

Multidimensional scaling analysis could not separate 
the diarrhea and diarrhea-free bacterial communities 
due to high inter-personal variation (Additional file 3: 
Figure S3). We estimated the association of individual 
OTUs with disease using statistical tests addressing 
both presence-absence statistics (Fisher s exact test and 
logistic regression) and abundance-dependent statistics 
(using generalized linear models) that account for the 
number of OTU-specific sequences in each stool, and 
potential confounders such as sampling depth, age, and 
country (see Additional file 4: Table S3 for a full sum- 
mary). The former address similar questions to those 
commonly targeted by the traditional culture-based epi- 
demiological studies, while the latter allow us to assess 
how pathogen proportional abundance correlates with 
morbidity. 

Ten OTUs were found to be positively associated with 
diarrhea by all statistical tests. The OTUs associated 
with MSD have high-similarity matches against database 
sequences from bacterial taxa in the Escherichia/Shigella, 
Granulicatella spp., and Streptococcus mitis/pneumoniae 
groups. When only abundance-dependent statistics are 
used to determine significance, an additional 18 OTUs are 
found to be highly associated with diarrhea, corresponding 
to the bacterial species Escherichia/Shigella, Campylo- 
bacter jejuni, and Streptococcus pasteurianus. When only 
considering presence/absence statistics, 43 additional 
OTUs are found to be associated with diarrhea, com- 
prising the bacterial groups already discussed above 
as well as members of the genera Lactobacillus, Neisseria, 
Citrobacter, Erwinia, and Haemophilus. It is noteworthy 
that all of these organisms are either facultatively anaer- 
obic or microaerophilic. 

On the other hand, there were no OTUs positively as- 
sociated with healthy stools by both statistical methods, 
reflecting the higher degree of inter- individual variation 
in microbiota content in healthy individuals. Consi- 
dering only presence/absence statistics, there are 43 
OTUs associated with non-diarrheal control samples. 
The genera associated with these control samples in- 
clude members of the clostridial families Peptostrepto- 
coccaceae, Eubacteriaceae, and Erysipelotrichaceae, and 
the genera Clostridium sensu stricto, Dialister, Entero- 
coccus, Prevotella, Ruminococcus, and Turicibacter. When 
considering only abundance statistics, an additional 19 
OTUs are significantly associated with non-diarrhea sam- 
ples and have high quality matches to database sequences 
corresponding to Bacteroides fragilis, Dialister, Mega- 
sphaera, Mitsuokella/Selenomonas, Prevotella spp., and 
Clostridium difficile. Thus, it can be seen that many obli- 
gate anaerobic bacterial lineages correlate with healthy 
status. 
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Functional differences between cases and controls 

The broad statements made above about oxygen tolerance 
in the diseased microbiota are supported by PICRUST 
[16] analyses of our data. Specifically, this showed putative 
signatures of obligate anaerobic gut lineages to be en- 
riched in the diarrhea-free samples (for example, glycoly- 
sis, P = 10~ 9 ; pyruvate metabolism, P = 10~ 7 ; short chain 
fatty acid biosynthesis, P = 10~ 3 ; xylene degradation, P = 
10" 7 ; and so on; all P values by Welch's t-test as computed 
by STAMP [17]), while oxygen dependent pathways (for 
example, the TCA cycle, P <10" 15 ) are enriched in diseased 
samples. 

Taxonomic groups correlated with dysentery 

We segregated diarrheal stool based on diagnosis of dys- 
entery (presence of blood) and found a total of 30 OTUs 
that were strongly correlated with dysentery when com- 
paring with non-dysentery diarrheal stool (metageno- 
meSeq [18], P <0.05). These include several well-known 
pathogens such as Enterococcus faecalis, Campylobacter 
jejuni, Bacteroides fragilis, Clostridium perfringens, En- 
terobacter cancer vgenus, and members of the Granulica- 
tella, Haemophilus, Klebsiella, and Escherichia/Shigella 
genera. Also associated with dysentery were members of 
the Streptococcus pasteurianus and Streptococcus sali- 
varius groups. A single OTU, corresponding to Lacto- 
bacillus ruminis, was found to be negatively associated 
with dysentery. A genus-level representation of these 
findings is shown in Figure 2. 

Network view of diarrheal illness 

The overall results presented above are also borne out 
in correlation networks constructed from the data 
(Additional file 3: Figure S7). At the broad level, in both 
MSD cases and controls, it can be seen that there tend to 
be negative correlations between facultative anaerobic lin- 
eages and obligate anaerobic lineages. The most obvious 
example is the negative correlation of the potentially pro- 
tective Prevotella genus with that of potential pathogens 
such as Escherichia/Shigella, Similarly, there are also 
positive correlations within these two phenotypic sub- 
groupings, such that obligate anaerobic genera such as 
Prevotella, Roseburia, and Dialister are correlated with 
each other, while facultative anaerobic or microaerophilic 
genera such as Streptococcus, Lactobacillus, Escherichia/ 
Shigella, and other Proteobacteria are also correlated with 
each other. The diarrhea-free network appears to be more 
tightly connected than the diarrheal network, consistent 
with ecological theories that equate environment diversity 
and connectedness with ecosystem stability/health [19,20]. 
At the same time, we would like to note that our data do 
not allow a reliable quantitative assessment of such phe- 
nomena due to the large level of inter-personal variation. 



Discussion 

Our analysis of the 16S rRNA gene-based taxonomic 
profile of diarrheal and control stool samples has dem- 
onstrated a strong association between acute diarrheal 
disease and the overall taxonomic composition of the 
stool microbiota in young children from the developing 
world. We have identified statistically significant disease 
associations with several organisms already implicated in 
diarrheal disease, such as members of the Escherichia/ 
Shigella genus and C. jejuni. In addition, we have un- 
covered an association with diarrheal disease for several 
organisms not widely believed to cause this disease, such 
as Streptococcus and Granulicatella. Streptococcal OTUs 
associated with disease primarily belong to either the 
Streptococcus pneumoniae/mitis group (indistinguishable 
within the 16S rRNA gene regions targeted by our study), 
which contains several important human pathogens, or 
the Streptococcus pasteurianus group. These results merit 
further exploration as recent studies provide evidence of 
Streptococcus-related diarrheal cases [21,22]. It is import- 
ant to stress that pathogenicity is only one of many pos- 
sible explanations for these findings and the organisms 
associated with disease status may also either: (1) usually 
inhabit the upper GI tract and become apparent in diar- 
rheal stool due to dislodging and reduced transit time dur- 
ing disease; (2) thrive in disturbed gut environments; (3) 
may be better able to persist/resist dislodgement during a 
diarrheal purge; or (4) a combination of pathogens may 
cause disease in these children [23]. Prior evidence cer- 
tainly suggests that facultative anaerobes (many of which 
we find associated with diarrhea) tend to flourish in a var- 
iety of perturbed gut environments, possibly because the 
reducing power of the microbiota is affected by the loss of 
obligate anaerobes following perturbation [15]. Any caus- 
ality would need to be demonstrated through further ex- 
perimentation. At the same time, streptococci are also 
found in our study to be associated with more severe 
forms of diarrhea (dysentery), thereby strengthening the 
case for a possible causal connection. Despite uncertainty 
regarding the causes and effects of microbiota pertur- 
bations in the setting of MSD, dissecting the physiologic 
implications is warranted. For example, an increase in 
streptococcal or other species in the setting of diarrhea 
may confer or exacerbate diarrheal effects. S. mutans has 
recently been postulated to have a role in human enteritis. 
Our work represents an important first step in under- 
standing the complex interaction between microbiota and 
diarrheal pathogens in developing country settings. 

Our study has also revealed a high prevalence of mem- 
bers of the Prevotella genus (primarily Prevotella copri) 
in the stool of developing world children, as well as the 
negative correlation of this genus with disease. These or- 
ganisms are prevalent in the developing world [14], yet 
are relatively poorly studied due to fairly low prevalence 
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Figure 2 Comparison of dysenteric and non-dysenteric stool. (A) Genus-level comparison of dysenteric and non-dysenteric diarrheal stool 
(top) stratified by age; (bottom) stratified by country. (B) Proportional abundance boxplots of Prevotel la, Lactobacillus, and Streptococcus in 
dysenteric and non-dysenteric diarrheal stools by age category. The upper whisker extends from the 75th percentile to the highest value that is 
within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends 
from the hinge to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers are outliers and are not plotted. Asterisks 
above the whisker indicate a statistically significant difference (by t-test) between dysenteric and non-dysenteric stools placed in the panel with 
the more abundant mean. A single asterisk indicates P <0.05; double asterisks indicate P <0.01. Prevotella is significantly associated with 
non-dysenteric cases overall {P = 0.0003) and in age groups 0 to 6 months {P = 0.01), 12 to 17 months (P = 0.03), and 24 to 59 months (P = 0.001). 
Lactobacillus is significantly associated with non-dysenteric cases overall {P = 0.0002) and in children 6 to 1 1 months (0.02) and 12 to 17 months 
{P = 0.003), while the genus Streptococcus is associated with dysentery overall {P = 0.007), particularly in children aged 1 2 to 17 months {P = 0.01). 



in the industrialized world [24]. Samples containing high 
proportions of members of the Prevotella genus also 
have higher overall bacterial diversity, potentially driven 
by the level of complex polysaccharides/starchy fiber in 
the diet. Recent evidence suggests that Prevotella spp. 



are particularly abundant in rural African children con- 
suming a high fiber diet [25]. This is in stark contrast to 
Western children, who typically have much higher abun- 
dances of Bacteroides spp., and very little Prevotella, a 
difference that is believed to be linked to diet [26]. 
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Our co-occurrence network analyses (Additional file 3: 
Figure S7) and proportional abundance analysis (Additional 
file 1: Table SI) suggest potential negative interactions 
between Prevotella and enteric pathogens, such as mem- 
bers of the Escherichia/Shigella genus, raising the pos- 
sibility for the development of novel Prevotella-based 
therapeutic strategies. Another possible probiotic organ- 
ism identified in our study is Lactobacillus ruminis. This 
organism was found to be associated with non-diarrheal 
stool and also with less severe forms of diarrhea when 
comparing diarrheal to dysenteric stool. Although the in- 
crease in frequency of these taxa in diarrhea could be due 
to shortened intestinal transit time, the difference in pre- 
valence of Lactobacillus between cases of MSD and dysen- 
tery are less likely to represent this effect. Lactobacillus 
ruminis has immunomodulatory properties and has been 
previously suggested as a potential probiotic [27] . 

Among OTUs found associated with non-diarrheal 
stool are sequences classified as Clostridum difficile, a 
surprising finding given that this organism is a common 
cause of enteric disease, primarily in hospitalized elderly 
patients. However, although C. difficile can be an impor- 
tant pathogen, it is actually carried asymptomatically by 
around 60% of infants [28]. We also found a conflicting 
association of OTUs assigned as Bacteroides fragilis with 
both the diarrhea-free status and dysentery, a finding 
that can perhaps be explained by strain-to-strain var- 
iation. Enterotoxigenic B. fragilis strains are well cha- 
racterized diarrheal agents in children [29] whereas, in 
contrast, non-toxigenic B. fragilis has been linked to 
anti-inflammatory protective effects in mouse models 
[30]. It is therefore possible that different strains, which 
cannot be differentiated through 16S rRNA gene se- 
quencing, might account for these opposing results. 

Our study identified many sequences that do not have 
good matches against cultured organisms in current 16S 
rRNA gene databases. Many of these sequences only have 
high-quality matches to other uncultivated and uncharac- 
terized intestinal microbes, highlighting the presence of a 
large reservoir of uncharacterized microbes in the in- 
testinal tract of children within the developing world, as 
reported before [31]. Many of the unknown sequences ap- 
pear to belong to obligate anaerobic lineages of the Firmi- 
cutes phylum, which are under-represented in culture 
collections compared to other intestinal dwelling groups 
such as Bacteroides and bifidobacteria. The prevalence of 
such unknown' sequences is higher in controls and several 
of these uncharacterized organisms exhibit strong associa- 
tions with diarrhea-free samples, highlighting their poten- 
tial role in the maintenance of a healthy gut microbiota, 
and suggesting the need for a better in-depth characte- 
rization of the gut microbiota of children within the devel- 
oping world, complementing resources recently developed 
in Europe [32] and the US [33]. 



Our observations related to the microbial succession 
in the developing infant gut microbiota carry several ca- 
veats. A single sample was collected from each child at a 
single point in time, and we lack extensive data on prior 
history of diarrhea. While the data are suggestive of a 
progression in microbiota structure, monitoring of a 
birth cohort will be necessary to fully understand the 
progression of gut microbiota, and assess the impact of 
diarrhea (including, potentially, multiple episodes of 
diarrhea) on this process. At a technical level, we would 
also note that the primer sets used in this study (target- 
ing the VI -V2 hypervariable regions of the 16S rRNA 
gene) do not effectively amplify bifidobacteria [34,35], 
known to be dominant members of the intestinal micro- 
biota of breast-fed infants, but this bias is likely to be 
uniform between cases and controls. We purposefully 
selected a primer set better targeted towards bacterial 
groups containing known and potential pathogens, such 
as Enterobacteriaceae, to improve our chances of detect- 
ing novel pathogens at the cost of obtaining less informa- 
tion about the already well-established early dominance 
by bifidobacteria. 

Our study revealed the limitations of existing molecu- 
lar and bioinformatics approaches employed in a clinical 
setting for performing taxonomic surveys of stool sam- 
ples. The use of the 16S rRNA gene, for example, does 
not afford a sufficient discrimination within taxonomic 
groups containing known or putative pathogens (Escher- 
ichia/Shigella, Streptococcus, and so on) indicating the 
pressing need for the development of new cost-effective 
and relatively unbiased molecular approaches [36] for in- 
creasing the resolution of epidemiological surveys such 
as ours. Relatedly, the accurate taxonomic assignment of 
sequences generated in studies such as ours is hampered 
by numerous errors in public databases and by the use 
of simplistic lowest common ancestor' heuristics by soft- 
ware tools faced with ambiguous taxonomic information. 
The results presented in this paper were obtained through 
the careful manual annotation of all the OTUs found 
to be associated with disease state (see Additional file 5: 
Table S4). Finally, we had to develop a novel statistical 
method [18] for identifying disease association in order to 
appropriately address data rarefaction as well as to control 
for the high inter-personal variability, a typical feature of 
the healthy gut microbiota [37], and other confounding 
factors. 

Conclusions 

Overall our study demonstrates that the major differ- 
ences in the microbiota between diarrheal and normal 
stools are quantitative differences in the proportions of 
the most prevalent taxa. Such quantitative differences 
were also observed in our previous qPCR-based study 
where we found that 80% (1,665/2,072) of controls and 
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89% (1,307/1,461) of MSD cases had detectable levels of 
Shigella. Quantitative measurements of Shigella abun- 
dance were critical to assessing attributable risk [38]. 
Among the known causes of diarrhea (rotavirus, Shigella, 
Cryptosporidium, Enterotoxigenic E. coli, and so on) the 
attributable fraction of diarrhea in young children is esti- 
mated to be just 43% [5]. Our study provides initial evi- 
dence for the existence of novel pathogenic agents. The 
most likely candidates from our study are members of the 
Enterobacteriaceae and streptococci, taxa which already 
contain many known human pathogens. Further explor- 
ation of these organisms is necessary to better understand 
their pathogenic potential and the likelihood of their 
emergence as major pathogens through the acquisition of 
additional pathogenicity factors. Importantly, our study re- 
veals a possible protective role against diarrhea for the 
Prevotella genus and Lactobacillus ruminis. Understand- 
ing such effect is important. For example, microbiological 
[39] or dietary [26] interventions may be possible in the 
supportive treatment of diarrhea in children similar to ap- 
proaches used in the management of enteric infections 
in adults [39-41]. Further genomic and epidemiologi- 
cal studies are necessary to better characterize this ge- 
nus and to assess the potential development of diet- or 
microbiological-based therapeutics. 

Materials and methods 

Study design and participants 

Stool samples were selected from a large case/control 
study of moderate-to-severe diarrhea in children aged 
under 5 years [42]. Cases were enrolled upon presenta- 
tion to a health clinic reporting MSD. MSD eligibility 
criteria included sunken eyes, loss of normal skin turgor, 
a decision to initiate intravenous hydration or to hos- 
pitalize the child, or the presence of blood in the stool. 
Controls were sought following case enrollment, sam- 
pled from a demographic surveillance database of the 
area. Individuals were excluded if they were unable to 
produce a sufficient amount of stool volume for testing 
or they were unable or unwilling to consent to involve- 
ment in the study. Every participant was consented prior 
to collection of their stool and their data. Consent was 
given by the caregiver (usually mother) because the pa- 
tients are all children aged less than 5 years. All samples 
were collected between March of 2008 and June of 2009. 
One sample was collected for each child and no time- 
series analyses were conducted. The Institutional Review 
Boards (IRBs) at all cooperating institutions have reviewed 
and approved the protocol. The IRB Federal Wide Assur- 
ance numbers for all the sites are as follows: University of 
Maryland Baltimore FWA00007145, The Gambia, Medical 
Research Council Labs FWA 00006873, Kenya Medical 
Research Institute FWA 00002066, University of Mali Fac- 
ulty of Medicine Pharmacy and Dentistry FWA 00001769, 



and International Centre for Diarrhoeal Disease Research, 
Bangladesh FWA 00001468. Further details on study de- 
sign are described by Kotloff et al [42]. 

Microbiology methods 

Stool specimens were collected in sterile containers and 
examined within 24 h. Stools were stored at 2 to 8°C 
while in transit to the laboratory. Each fresh stool speci- 
men was aliquoted into multiple tubes. All samples were 
analyzed by traditional microbiological tests for known 
bacterial, viral, and eukaryotic pathogens. Details of 
these methods can be found in Panchalingam et al. [43] 
DNA was isolated using a bead beater with 3 mm diam- 
eter solid glass beads (sigma Life Science), and subse- 
quently 0.1 mm zirconium beads (BIO-SPEC Inc.) to 
disrupt cells. The cell slurry was then centrifuged at 
16,000 g for 1 min, the supernatant removed and proc- 
essed using the Qiagen QIAamp® DNA stool extraction 
kit. Extracted DNA was precipitated with 3 M sodium 
acetate and ethanol and the DNA shipped to the USA. 

Amplification and sequencing 

DNA was amplified using 'universal' primers targeting 
the VI -V2 region of the 16S rRNA gene (small subunit 
of the ribosome) in bacteria (338R (5'- CATGCTGCC 
TCCCGTAGGAGT-3' and 27 F (5'-AGAGTTTGATC 
CTGGCTCAG-3'). Both forward and reverse primers 
had a 5' portion specific for use with 454 FLX sequen- 
cing technology and the forward primers contained a 
barcode between the FLX and gene specific region, so 
that samples could be pooled to a multiplex level of 96 
samples per instrument run (see Additional file 6: Table S5 
for barcode information). 

Data availability 

Sequencing data and sample metadata are available at 
the NCBI archive under project PRJNA234437. 

Source code and documentation for the analysis pipe- 
line are available at GitHub: [44]. 

Abundance table and metadata are available, in BIOM 
[45] format, at [46]. 

Additional information on the study as well as links to 
all resources outlined above are made available at [47]. 

Analysis pipeline 

The individual reads were filtered for quality using cus- 
tom in-house scripts that perform the following checks 
suggested in Huse et al. [48]: (1) sequences containing 
any ambiguity codes (N) are removed; (2) sequences that 
were shorter than 75 cycles of the 454 instrument were 
removed (each cycle yields an average of 2.5 bp depen- 
ding on the sequence composition); (3) sequences for 
which a barcode could not be identified were removed. 
These checks are similar to those that can be performed 
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by Mothur [49]. The high quality sequences were sepa- 
rated into 992 sample-specific sets according to the mul- 
tiplexing barcodes. Conservative OTUs were clustered 
using DNAclust [6] with parameters (-r 1) (99% identity 
radius) thus ensuring that the definition of an OTU is 
consistent across all samples. To obtain taxonomic iden- 
tification, a representative sequence from each OTU was 
aligned to Ribosomal Database (RDP) [50] (rdp.cme. 
msu.edu, release 10.4) using blastn with long word 
length (-W 100) in order to only detect nearly identical 
sequences. Sequences without a nearly identical match 
to RDP (>100 bp perfect match and >97% identity, as 
defined by BLAST) were marked as being 'unassigned' 
and assigned an OTU identifier. The resulting data were 
organized into a collection of tables at several taxo- 
nomic levels containing each taxonomic group as a row 
and each sample as a column. 

We note that the clustering criteria we use (<2% diver- 
gence, including insertions and deletions) are more con- 
servative than commonly used definitions of species- 
lever OTUs (<2% divergence excluding indels). We used 
conservative clustering because no universal cutoff ap- 
plies to all organisms [51] and in order to avoid merging 
together organisms with potentially different phenotypes 
(for example, closely-related strains, see Additional file 3: 
Figure S4 for an example in closely-related Escherichia/ 
Shigella OTUs). Similar considerations have led to the 
development of specialized software for the analysis of va- 
ginal 16S rRNA gene survey data [52]. Our approach pro- 
vides a good tradeoff between mitigating the effect of 
errors and allowing an unbiased analysis of the data. Fur- 
thermore, an exploration of increasingly permissive clus- 
tering thresholds reveals that our conservative clustering 
strategy does not lose statistical power (see Additional file 3: 
Figures S5, S6). 

Chimera checking was performed with Uchime 4.2.40 
[53]. 

PICRUST analysis 

The most abundant 6879 OTUs were reprocessed using 
QIIME [7] version 1.8.0-dev as recommended on the 
PICRUST website (specifically OTUs were constructed 
with the pick_closed_reference_otus.py script against the 
latest version (version 13.5) of the Greengenes [54] data- 
base) and the resulting information was processed with 
PICRUST [16] version 1.0.0-dev using the KEGG ana- 
lysis module and aggregating the results to level 3. The 
results were further explored with STAMP [17] version 
2.0.2, using the two-group analysis module, focusing on 
known aerobic and anaerobic pathways. 

Data normalization 

In order to avoid the bias that may be introduced by pref- 
erential amplification or sequencing of specific sequences, 



we scaled the counts by the 56 1 percentile of the number 
of OTUs in each sample. The 56 th percentile was empiric- 
ally determined from the distribution of non-zero counts 
required to behave consistently across our samples. We 
normalized with a Cumulative Sum Scaling approach, 
which scales counts by dividing the sum of each sample s 
counts up to and including the pth. quantile (that is, for all 
samples S p = Z *(0/| c j/) - Qpp where q p j is the p th quantile 
of sample /). Normalized counts are then given by ^- 1000. 

This method constrains communities with respect to a 
total size, but does not place undue influence on features 
(OTUs) that are preferentially sampled. A full description 
of the methodology is provided in Paulson et al. [18]. 

Statistical approaches 

To test for presence and absence of an organism we per- 
formed Fisher s test stratifying by positive and negative 
samples. Samples were stratified as positive for an organ- 
ism if the sample had one or more sequences of the or- 
ganism with a sample being negative if there was absence 
of sequences. The totals were calculated for each taxa, a 
minimum of 20 positive samples was required for a sta- 
tistical test to be attempted. To correct for multiple com- 
parisons we minimized the expected proportion of false 
positives following Benjamini and Hochberg [55]. 

Differential abundance was assessed with the package 
metagenomeSeq [18] - a statistical approach that models 
confounding such as age and country, and also the effect 
of undersampling on the observed counts. Significant 
findings were reported for OTUs that satisfied the fol- 
lowing criteria: (1) OTU was abundant (>12 normalized 
counts per sample) in cases or controls; (2) OTU was 
prevalent (present in >10 cases and controls); (3) OTU 
had fold change or odds ratio exceeding 2 in either cases 
or controls; and (4) statistical association was significant 
(P <0.05) after Benjamini-Hochberg correction for mul- 
tiple testing. 

Analyses were performed using the R software package 
3.0.1 and packages, Vegan 2.0-7 and metagenomeSeq 
1.2.21. 

Correlation network construction 

Correlation networks were constructed separately on 
cases and controls to characterize the dependencies bet- 
ween 268 differentially abundant OTUs (Additional file 4: 
Table S3). 

Each network was built using SparCC [56], a tool spe- 
cifically developed for assessing the correlation structure 
within microbial communities. The statistical significance 
for each OTU-OTU-interaction was obtained with an em- 
pirical null distribution using 1,000 bootstrap iterations. 
The P values were further adjusted for multiple compari- 
sons using the Benjamini and Hochberg [55] correction. 
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All OTU-OTU-interactions with FDR < =0.05, were con- 
sidered significant and were represented as edges in the 
network. 

For simplicity of visual representation, OTUs were ag- 
gregated at genus or lower taxonomic levels using the 
median normalized abundance of the aggregated OTUs 
as the abundance of the corresponding taxonomic group. 
We omitted all taxonomic groups with median abundance 
lower than 500 normalized counts, as well as all edges 
with SparCC correlation lower than 0.09. The plots were 
drawn in Cytoscape 3.0.1 [57]. 

Additional files 
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